Model Selection

Cross-modal Retrieval

# Cross-modal Retrieval

Unime LLaVA OneVision 7B

UniME is a general embedding learning framework based on multimodal large models, significantly enhancing multimodal embedding capabilities through text discriminative knowledge distillation and hard negative sample-enhanced instruction tuning strategies.

Multimodal Alignment

Transformers English

Unime LLaVA 1.6 7B

UniME is a general embedding learning model based on a multimodal large model, trained with 336×336 image resolution and ranked first on the MMEB leaderboard.

Transformers English

A multimodal embedding model based on Qwen2.5-Omni-7B, supporting unified embedding representations for cross-lingual text, images, audio, and video

Multimodal Fusion

Mme5 Mllama 11b Instruct

mmE5 is a multimodal multilingual embedding model trained on Llama-3.2-11B-Vision, improving embedding performance through high-quality synthetic data and achieving state-of-the-art results on the MMEB benchmark.

Multimodal Fusion

Transformers Supports Multiple Languages

ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts, suitable for various medical imaging modalities, capable of achieving robust performance across multiple medical imaging tasks.

Transformers English

MEXMA-SigLIP is a high-performance CLIP model combining multilingual text encoder and image encoder, supporting 80 languages.

Text-to-Image Supports Multiple Languages

LLM2CLIP Llama 3 8B Instruct CC Finetuned

LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.

Multimodal Fusion

A multilingual vision-language pre-trained model for the remote sensing field, supporting image-text cross-modal tasks in 10 languages.

Image-to-Text Supports Multiple Languages

A large-scale vision-language model based on Vision Transformer architecture, supporting cross-modal understanding between images and text

Nomic Embed Vision V1.5

High-performance visual embedding model, sharing the same embedding space with nomic-embed-text-v1.5, supporting multimodal applications

Transformers English

Nomic Embed Vision V1

High-performance vision embedding model, sharing the same embedding space with nomic-embed-text-v1, supporting multimodal applications

Transformers English

Clip ViT B 32 Vision

ONNX ported version based on CLIP ViT-B/32 architecture, suitable for image classification and similarity search tasks.

Image Classification

M3D-CLIP is a CLIP model specifically designed for 3D medical imaging, achieving visual and language alignment through contrastive loss.

Multimodal Alignment

Blair Roberta Base

BLaIR is a language model pre-trained on the Amazon Reviews 2023 dataset, focusing on recommendation and retrieval scenarios, capable of generating powerful product text representations and predicting relevant products.

Transformers English

Owlv2 Base Patch16

OWLv2 is a vision-language pre-trained model focused on object detection and localization tasks.

Object Detection

Internvl 14B 224px

InternVL-14B-224px is a 14B-parameter vision-language foundation model supporting various vision-language tasks.

Languagebind Video Huge V1.5 FT

LanguageBind is a pretrained model that achieves multimodal semantic alignment through language, capable of binding various modalities such as video, audio, depth, and thermal imaging with language to enable cross-modal understanding and retrieval.

Multimodal Alignment

Vilt Finetuned 200

Vision-language model based on ViLT architecture, fine-tuned for specific tasks

Languagebind Audio FT

LanguageBind is a language-centric multimodal pretraining method that achieves semantic alignment by using language as the bridge between different modalities.

Multimodal Alignment

Languagebind Video Merge

LanguageBind is a multimodal model that extends video-language pretraining to N modalities through language-based semantic alignment, accepted by ICLR 2024.

Multimodal Alignment

Metaclip B16 Fullcc2.5b

MetaCLIP is an implementation of the CLIP framework applied to CommonCrawl data, aiming to reveal CLIP's training data filtering methods

Metaclip B32 400m

The MetaCLIP base model is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.

Languagebind Image

LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment.

Multimodal Alignment

Languagebind Depth

LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment across video, infrared, depth, audio, and other modalities.

Multimodal Alignment

Languagebind Video

LanguageBind is a multimodal pretraining framework that extends video-language pretraining to N modalities through language semantic alignment, accepted by ICLR 2024.

Multimodal Alignment

Languagebind Thermal

LanguageBind is a pretraining framework that achieves multimodal semantic alignment through language as the bond, supporting joint learning of various modalities such as video, infrared, depth, and audio with language.

Multimodal Alignment

This is a vision-language model based on the CLIP architecture, specifically post-trained on 80 million face images.

Multimodal Fusion

CLIP Giga Config Fixed

A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text

Clip Vit Base Patch32

CLIP model developed by OpenAI, based on Vision Transformer architecture, supporting joint understanding of images and text

Clip Vit Base Patch16

OpenAI's open-source CLIP model, based on Vision Transformer architecture, supporting cross-modal understanding of images and text

CLIP ViT L 14 CommonPool.XL.laion S13b B90k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the laion dataset

CLIP ViT B 16 DataComp.L S1b B8k

A zero-shot image classification model based on the CLIP architecture, trained using the DataComp dataset, supporting efficient image-text matching tasks.

CLIP ViT B 16 CommonPool.L.text S1b B8k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

Eva Giant Patch14 Plus Clip 224.merged2b S11b B114k

EVA-Giant is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Eva02 Large Patch14 Clip 336.merged2b S6b B61k

EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Xclip Large Patch14 16 Frames

X-CLIP is an extension of CLIP for general video-language understanding, achieving video classification and video-text retrieval tasks through contrastive learning.

Transformers English

Clip Vit Large Patch14 336

A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text

Mengzi Oscar Base

A Chinese multimodal pretraining model built on the Oscar framework, initialized with Mengzi-Bert base version, trained on 3.7 million image-text pairs.

Transformers Chinese

M BERT Base ViT B

A multilingual CLIP text encoder fine-tuned from BERT-base-multilingual, supporting alignment with CLIP visual encoder across 69 languages

Multimodal Alignment

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase